Multi-armed bandits with censored consumption of resources
نویسندگان
چکیده
Abstract We consider a resource-aware variant of the classical multi-armed bandit problem: In each round, learner selects an arm and determines resource limit. It then observes corresponding (random) reward, provided amount consumed resources remains below Otherwise, observation is censored, i.e., no reward obtained. For this problem setting, we introduce measure regret, which incorporates both actual learning round optimality realizable rewards as well risk exceeding allocated Thus, to minimize needs set limit choose in such way that chance realize high within predefined high, while itself should be kept low possible. propose UCB-inspired online algorithm, analyze theoretically terms its regret upper bound. simulation study, show our algorithm outperforms straightforward extensions standard algorithms.
منابع مشابه
Multi-Armed Bandits with Betting
In this paper we consider an extension where the gambler has, at each round, K coins available for play, and the slot machines accept bets. If the player bets m coins on a machine, then the machine will return m times the payoff of that round. It is important to note that betting m coins on a machine results in obtaining a single sample from the rewards distribution of that machine (multiplied ...
متن کاملContextual Multi-Armed Bandits
We study contextual multi-armed bandit problems where the context comes from a metric space and the payoff satisfies a Lipschitz condition with respect to the metric. Abstractly, a contextual multi-armed bandit problem models a situation where, in a sequence of independent trials, an online algorithm chooses, based on a given context (side information), an action from a set of possible actions ...
متن کاملStaged Multi-armed Bandits
In conventional multi-armed bandits (MAB) and other reinforcement learning methods, the learner sequentially chooses actions and obtains a reward (which can be possibly missing, delayed or erroneous) after each taken action. This reward is then used by the learner to improve its future decisions. However, in numerous applications, ranging from personalized patient treatment to personalized web-...
متن کاملMortal Multi-Armed Bandits
We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified ...
متن کاملRegional Multi-Armed Bandits
We consider a variant of the classic multiarmed bandit problem where the expected reward of each arm is a function of an unknown parameter. The arms are divided into different groups, each of which has a common parameter. Therefore, when the player selects an arm at each time slot, information of other arms in the same group is also revealed. This regional bandit model naturally bridges the non...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Machine Learning
سال: 2022
ISSN: ['0885-6125', '1573-0565']
DOI: https://doi.org/10.1007/s10994-022-06271-z